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Abstract 

A  nonparametric  Bayesian  model  is  proposed  for  segmenting  time-evolving  mul¬ 
tivariate  spatial  point  process  data.  An  inhomogeneous  Poisson  process  is  assumed, 
with  a  logistic  stick-breaking  process  (LSBP)  used  to  encourage  piecewise- const  ant 
spatial  Poisson  intensities.  The  LSBP  explicitly  favors  spatially  contiguous  segments, 
and  infers  the  number  of  segments  based  on  the  observed  data.  The  temporal  dynam¬ 
ics  of  the  segmentation  and  of  the  Poisson  intensities  is  modeled  with  exponential 
correlation  in  time,  implemented  in  the  form  of  a  first-order  autoregressive  model 
for  uniformly  sampled  discrete  data,  and  via  a  Gaussian  process  with  an  exponential 
kernel  for  general  temporal  sampling.  We  consider  and  compare  two  different  in¬ 
ference  techniques:  a  Markov  chain  Monte  Carlo  sampler,  which  has  relatively  high 
computational  complexity;  and  an  approximate  and  efficient  variational  Bayesian 
analysis.  The  model  is  demonstrated  with  a  simulated  example  and  a  real  example 
of  space-time  crime  events  in  Cincinnati,  OH,  USA. 

Keywords:  Bayesian  hierarchical  model,  spatial  segmentation,  temporal  dynam¬ 
ics,  Gaussian  process,  logistic  stick  breaking  process,  inhomogeneous  Poisson  process 
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1  Introduction 


1.1  Motivating  application 

Assume  access  to  the  locations  of  various  types  of  crimes  occurring  in  a  given  city, 
as  a  function  of  time.  As  a  motivating  example,  in  Figure  1(a)  data  are  shown 
for  3090  crimes  (of  17  crime  types)  in  Cincinnati  in  Jan  2008.  Our  focus  is  on 
obtaining  a  spatial  segmentation,  such  as  that  shown  in  Figure  1(b).  In  addition  to 
the  spatial  dependence  of  point  process  data,  we  wish  to  simultaneously  explore  time 
dynamics.  For  example,  in  the  crime  data  analysis,  the  crime  intensity  in  summer 
may  be  different  statistically  from  that  in  winter,  and  this  intensity  may  change 
smoothly  over  seasons;  consequently,  the  spatial  segmentation  of  the  city  may  also 
vary  smoothly  over  time. 

The  analysis  of  time  dynamics  helps  to  discover  the  temporal  pattern  of  the 
events  and  to  predict  the  spatial  segmentation  at  an  unobserved  time  instance  or 
in  the  future.  We  desire  that  the  analysis  provide  a  simple  summary  that  is  useful 
to  police  forces  and  city  planners  in  targeting  resources,  as  well  as  to  researchers  in 
studying  crime  trends.  We  would  like  to  obtain  this  space-time  segmentation  quickly, 
utilizing  data  from  different  types  of  events,  while  allowing  temporal  interpolation 
and  forecasting. 

1.2  Summary  of  proposed  model 

Consider  the  data  V  =  {s«,  t=i,...,T,  where  Vu  is  a  d-dimensional  vector 

of  the  counts  of  d  types  of  events,  occurring  in  (small)  spatial  region  A(st),  with 
the  center  of  the  region  being  Sj  6  M2;  in  the  context  of  Figure  1,  we  are  interested 
in  d  types  of  crime.  The  contiguous  grid  of  spatial  regions  A(-)  is  fixed  in  advance, 
and  the  size  of  A(-)  is  very  small  relative  to  the  size  of  the  entire  spatial  domain, 
providing  justification  for  an  approximation  in  which  we  index  regions  by  the  center 
point  and  assume  homogeneity  within  regions  (using  the  model  developed  below,  in 
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(a)  Crime  events  in  Cincinnati  during  Jan.,  (b)  Segmentation  of  Cincinnati 

2008 


Figure  1:  Crime  events  and  the  segmentation  of  the  city.  In  (a)  3090  crime  events  are  shown  as 
black  dots;  in  (b)  each  color  indexes  a  segment  with  associated  crime  intensities  in  17  crime  types 
(see  result  section  for  details). 


the  limit  A^Owe  have  a  Poisson  process) .  There  are  T  time  points  at  which  data 
are  observed,  not  necessarily  uniformly  spaced  in  time.  Although  not  done  here,  one 
may  envision  aligning  the  grid  A(-)  with  the  geometry  of  the  terrain  (e.g.,  roads). 
The  proposed  space-time  model  may  be  summarized  as 

d  K 

Vit  ~  JI  Poisson(Ajit),  Xlt  ~  ^  wk(su  Gkt)Sx*t  (1) 

j= 1  k= 1 

where  wk(si ;  6kt)  >  0,  Ylk=i  wk(si ;  Gkt)  =  1  f°r  all  si ,  3\*kt  is  a  unit  measure  concen¬ 
trated  at  X*kt,  and  Xljt  is  the  j th  component  of  X,t .  This  corresponds  to  a  mixture 
model,  with  space-time  varying  mixture  weights  wk(si ;  Gkt)  and  time-varying  atoms 

Akf 

Expression  wk{s\6kt )  represents  a  general  parametric  function  capable  of  mod¬ 
eling  the  probability  of  cluster  k  at  spatial  location  s.  In  the  details  of  the  proposed 
model,  one  of  the  {wk(s]  Okt)}k=i,K  is  likely  to  be  dominant  (large  probability)  over 
a  contiguous  region,  yielding  a  segmentation.  Since  the  parameters  0kt  change  in 
general  with  time  t,  a  probabilistic  space-time  segmentation  is  manifested.  Within 
the  proposed  model,  the  prior  encourages  that  {0kt}  and  A *kt  vary  smoothly  as  a 
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function  of  time,  and  hence  the  model  imposes  smooth  space-time  variation  in  the 
shape/form  of  the  segments,  and  smooth  temporal  variation  of  the  Poisson  rates 
associated  with  a  given  segment. 

Two  methods  are  considered  for  imposing  temporal  smoothness,  representing 
two  perspectives  on  imposing  the  same  temporal  structure.  For  discrete-time  data 
with  uniform  temporal  spacing,  it  is  natural  to  consider  the  first-order  autoregressive 
model,  i.e.,  AR(1),  as  0kpt  ~  N{C@kp(t-i)i  ao 1)>  with  @kpt  the  pth  component  of  Oj-t, 
(,  the  AR(1)  coefficient  (with  |£|  <  1),  and  o:0  a  precision  to  be  inferred  (£  and  a0 
could  also  be  extended  to  depend  on  k  and  p).  The  log  of  each  component  of  \*kt 
may  be  similarly  modeled. 

We  also  consider  a  Gaussian  process  (GP)  model  Rasmussen  and  Wiliams  (2006) 
in  time  for  each  component  6tpt ,  and  for  the  log  of  each  component  of  A(;t,  this  allow¬ 
ing  non-uniform  temporal  sampling.  To  make  the  AR(1)  and  GP  models  consistent, 
we  assume  an  exponential  model  for  the  GP  covariance  between  times  tt  and  0, 
cocf*-*'1,  with  Ci  playing  a  role  analogous  to  (,  in  the  AR(1)  model,  and  the  variance 
Co  corresponds  to  [(1  —  £2)a;o]_1  from  the  AR(1)  model.  The  AR(1)  and  chosen 
GP  representations  are  therefore  essentially  different  means  of  imposing  the  same 
temporal  prior,  with  the  former  restricted  to  uniform  temporal  sampling. 

In  addition  to  developing  a  new  model  for  multivariate  inhomogeneous  space- 
time  Poisson  process  data,  a  contribution  of  this  paper  concerns  computations,  in 
the  form  of  a  detailed  comparison  of  Markov  chain  Monte  Carlo  (MCMC)  and  vari¬ 
ational  Bayesian  (VB)  inference  for  this  class  of  models.  The  former  is  widely  used, 
but  it  can  be  computationally  prohibitive  for  the  motivating  large-scale  problems 
considered  here.  Computations  based  on  VB  are  attractive  for  large-scale  modeling 
studies,  but  many  simplifying  assumptions  must  be  made. 
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1.3  Related  research 


A  natural  model  for  exploiting  spatial  information,  and  to  model  point  process  data, 
is  the  inhomogeneous  Poisson  process  Diggle  (2003);  Moller  and  Waagepetersen 
(2004).  Researchers  have  recently  studied  nonparametric  Bayesian  approaches  for 
such  applications.  One  of  these  approaches  models  the  Poisson  intensity  function  by 
a  variation  of  a  Gaussian  process  (GP)  Adams  et  al.  (2009);  Rathbun  and  Cressie 
(1994);  Mpller  et  al.  (1998).  The  log-Gaussian  Cox  process  Mpller  et  al.  (1998), 
corresponding  to  an  intensity  function  modeled  as  an  exponentiated  GP,  has  proven 
highly  successful  in  point  process  Hossain  and  Lawson  (2009)  and  geostatistical 
modeling  Diggle  et  al.  (2010);  Pati  et  al.  (2010).  Mixture  models  provide  another 
approach  to  representing  the  Poisson  intensity  function  Wolpert  and  Ickstadt  (1998). 
Kottas  and  Sanso  (2007)  proposed  a  Dirichlet  process  (DP)  mixture  model  of  bi¬ 
variate  beta  densities  to  model  heterogeneity  in  intensity  function.  Dirichlet  process 
mixture  models  of  multivariate  normal  densities  can  be  also  found  in  Ji  et  al.  (2009); 
Chakraborty  and  Gelfand  (2010). 

In  Taddy  (2008,  2010);  Taddy  and  Kottas  (2012)  a  dynamic  model  was  proposed 
for  Poisson  point  processes,  based  on  a  novel  version  of  the  dependent  Dirichlet 
process.  Models  of  this  type  have  been  applied  to  the  data  considered  in  Figure  1, 
although  the  problem  of  segmentation  was  not  considered.  In  Achcar  et  al.  (2011) 
a  time  inhomogeneous  Poisson  model  was  proposed,  with  change-points  to  estimate 
the  number  of  times  that  a  given  environmental  standard  is  violated  in  a  time 
interval  of  interest. 

Rather  than  modeling  the  Poisson  intensity  via  a  GP  or  a  DP  mixture  model, 
the  model  in  (1)  constitutes  a  mixture  model  with  space-time  mixture  weights, 
and  the  spatial  locations  {s,}  of  the  grid  are  modeled  as  covariates.  The  details 
of  how  Wk(s]6kt )  is  modeled  encourages  contiguous  regions  in  space  and  time  for 
which  a  single  component  (cluster)  dominates,  encouraging  a  piecewise-constant 
Poisson  intensity  function.  In  Heikkinen  and  Arjas  (1998)  the  authors  similarly 
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build  a  piecewise  constant  prior  model  for  spatial  Poisson  intensities,  using  Voronoi 
tessellations.  We  model  Wk{s;0kt )  via  an  extension  of  the  logistic  stick-breaking 
process  (LSBP)  Ren  et  al.  (2011).  The  region  of  interest  is  partitioned  into  a  set  of 
contiguous  small  square  cells,  with  related  ideas  considered  in  Hossain  and  Lawson 
(2009).  Within  the  context  of  the  aforementioned  GP  construction  for  the  temporal 
dependence  of  0kt,  related  ideas  were  presented  in  the  context  of  factor  analysis 
Luttinen  and  Ilin  (2009),  where  GPs  were  used  to  describe  the  smoothness  of  both 
spatial  locations  and  time.  An  AR  model  for  temporal  dynamics  was  considered  in 
Taddy  (2008,  2010). 

2  Model  Details 

2.1  Basic  construction 

The  proposed  space-time  model  for  data  V  =  {s*,  ■Uit}j=i,...,M,t=i,...,T  is  summarized 
as 

d  K 

vit  ~  n  Poisson(Ajjt),  wk(sit)Sxlt  (2) 

j— 1  k=l 

k- 1 

Wk(sit)  =  Pk(sit)  JJ[1  -  Ph(sit)]  (3) 

h= 1 

Pk{sit)  =  cr(gk(sit)),  for  k  =  1, ...,  K  -  1,  pK(sit)  =  1  (4) 

j 

9k(.Sit)  Pkjt^ipit  •:  Sj ,  ^k)  T  PkOt  (5) 

3= 1 

where  (2)  is  repeated  here  from  (1),  for  convenience.  Below  we  explain  and  motivate 
each  term  in  this  construction.  Parameters  Okt  from  the  Introduction  correspond 
here  to  {(3kjt}j= o,j  and  by.  In  what  follows,  the  notation  sit  is  meant  to  assign  statis¬ 
tics  to  spatial  location  s,  at  time  t\  for  example,  wk{sn )  is  the  fcth  mixture  weight 
as  observed  at  st  and  time  t.  The  spatial  grid  defining  the  regions  (A(s*)}j=ka/  is 
not  changing  with  time. 
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The  expression  in  (3),  with  Pk(sit )  G  [0, 1]  for  all  sit,  is  suggestive  of  the  stick¬ 
breaking  representation  of  the  Dirichlet  process  Sethuraman  (1994).  The  function 
a(x)  =  exp(a:)/(l  +  exp(a:))  is  associated  with  a  logistic  model,  and  PK(sit)  =  1 
such  that  Ylk=iwk(sit)  =  1  for  all  sit ■  By  the  construction  of  gk{sit)  in  (5),  the 
probabilities  Pk(su)  have  space-time  variation,  with  such  variation  transferred  to 
the  mixture  weights  tty(sjt)  via  (3).  Therefore,  via  mixture  weights  uy(sjt)  in  (2) 
we  constitute  a  multivariate  Poisson  mixture  model,  with  weights  that  vary  as  a 
function  of  sit. 

Function  JC(s,  Sj]  ipk)  denotes  a  kernel  with  parameter  by.  Here  we  employ  the 
radial  basis  function  IC(s,  Sj-,xpk)  =  exp(— ||s  —  Sj |||/by),  with  J  predefined  kernel 
centers  {s3}3=iyj\  for  convenience  these  J  centers  are  here  aligned  with  the  centers 
of  the  spatial  grid  defined  by  A(s,-)  (recall  discussion  in  the  Introduction).  The  ap¬ 
propriate  kernel  parameters  {by}  will  be  inferred.  To  ease  computations,  we  assume 
a  discrete  set  of  parameters  {xpl, . . .  ,  ip}  }  over  which  a  uniform  prior  is  placed;  each 
kernel  parameter  by  is  assumed  drawn  from  this  finite  library  of  parameters. 

The  space-time  dependence  of  the  model  is  manifested  in  how  { fikjt  }j=o,J  arid 
{A £t}  are  modeled. 

2.2  Temporal  modeling 

When  the  data  are  sampled  uniformly  in  time,  an  autoregressive  (AR)  temporal 
model  is  natural.  Specifically,  we  consider 

Pkjt  ~  1),  a^1)  ,  j  =  0, . . . ,  J  (6) 

lo§ Kjt  ~ -A/"^ loS Kj{t- 1)» «a x) ,  j  =  (7) 

with  [3kjo  =  log  Ajy0  =  0.  Gamma  priors  are  placed  on  ay  and  a.\.  Further,  (  and  p 
are  drawn  from  a  truncated  normal  A/(o,i)  (0, 1)  with  0  <  C,S  <  1- 

The  collection  of  data  may  be  expensive,  and  there  may  be  situations  for  which 
nonuniform  temporal  sampling  is  desired  (e.g.,  to  provide  fine-scale  sampling  in 
particular  regions  -  seasons  -  of  time  that  may  be  interesting) .  This  suggests  using 
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a  Gaussian  process  (GP)  model  Rasmussen  and  Wiliams  (2006)  for  the  temporal 
variation  of  fikjt  and  logA£  -t. 

For  the  fcth  mixture  component,  we  let 

j 

Bk  ~  Af(Bk\0,  Qk)  =  JJ.A/*(/3fc_/:|0,  [^kj\u  =  c0c^tl~tl^  (8) 

3=  0 

where  f3kj:  =  [f3kji,  ■■■,  /3kjT]T ,  and  Bk  e  R^-7*1)  denotes  a  vector  formed  by  con¬ 
catenating  f3kj:  for  j  =  0, J.  The  covariance  flk  is  a  block-diagonal  matrix  of  size 
T(J+  1)  xT(J  +  l),  and  each  block  Hkj  is  a  T  xT  covariance  matrix;  the  entry  at 
row  i  and  column  l,  denoted  as  is  evaluated  using  the  GP  covariance  function 

with  the  hyperparameters  {co,ci}.  A  gamma  prior  is  placed  on  cq .  Since  c\  plays 
the  same  role  with  £,  we  also  draw  c0  from  the  truncated  normal  A/(o,i)(0, 1)  with 

0  <  ci  <  1. 

The  Gaussian  process  priors  are  also  placed  on  log  Xl;]t.  For  mixture  component 

k 


log(Ay  -  W(0,  Tkj),  [Tkj]u  =  d0d ^  (9) 

where  log(A£-:)  =  [log(A^jl), log(A^-r)]r,  and  the  covariance  matrix  Tkj  G  MTxT, 
with  the  entries  defined  by  the  GP  covariance  function  with  the  hyperparameters 
{doi^i}-  A  gamma  prior  and  truncated  normal  prior  are  placed  on  d0  and  (h-  As 
discussed  in  the  Introduction,  the  considered  AR(1)  and  GP  priors  are  consistent, 
and  provide  different  modeling  strategies  for  the  same  imposed  temporal  dynamics. 

2.3  Model  interpretation 

Equations  (3)- (5)  are  of  the  form  of  the  logistic  stick-breaking  process  (LSBP)  intro¬ 
duced  in  Ren  et  al.  (2011);  however,  that  paper  did  not  consider  Poisson  data,  and 
space-time  processes  were  not  addressed.  Recall  that  a(x)  ~  1  for  x  >  4;  we  refer  to 
this  as  the  “clipping”  property  of  the  logistic,  as  all  x  larger  than  about  4  contribute 
effectively  in  the  same  manner  to  a(x);  one  may  alternatively  use  a  probit  model,  to 
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achieve  the  same  end.  If  fikjt  >  4,  then  Pk(s)  ~  1  for  ||s  —  Sj |||  <  xpk-  This  implies 
via  (3)  that  within  region  ||s  —  Sj\\%  <  ?/y.,  if  flkjt  >  4  mixture  component  k  is  highly 
probable  (assuming  that  other  clusters  k!  ^  k  do  not  have  large  py(s )  in  the  vicinity 
of  Sj).  The  “clipping”  nature  of  the  logistic  function,  and  large  values  of  Bkjt  >  4, 
encourage  contiguous  regions  for  which  a  given  cluster  k  has  high  space-time  prob¬ 
ability  of  being  manifested  (all  locations  s  at  which  gy(s)  >  4  have  similarly  high 
probability  of  being  associated  with  cluster  k ,  regardless  of  the  exact  value  of  gk{s)). 
The  weights  {0kjt}  play  the  role  of  assigning  which  regions  in  space-time  are  most 
likely  to  be  associated  with  a  given  cluster  k,  and  by  defines  the  size  scale  of  the 
cluster.  Note  that  while  we  truncate  the  model  to  K  mixture  components,  this  does 
not  mean  that  all  components  need  actually  be  used  to  represent  the  data.  For 
example,  if  a  given  /3kot  is  large  and  negative,  then  the  kth  mixture  component  is 
unlikely  to  be  utilized  at  all  spatial  locations  at  time  t;  K  is  simply  an  upper  bound 
on  the  number  of  mixture  components  (segment  types). 

3  Posterior  inference 

The  posterior  distribution  of  the  model  parameters  is  inferred  via  an  MCMC  sampler 
and  via  variational  Bayesian  (VB)  inference  Beal  (2003).  The  VB  inference  typically 
converges  fast  and  is  computationally  efficient;  by  contrast,  MCMC  convergence 
may  be  difficult  to  diagnose,  and  a  large  number  of  iterations  are  required  to  collect 
samples  representing  the  joint  posterior  distribution.  The  detailed  MCMC  and  VB 
update  equations  are  provided  in  the  Appendix  (we  provide  equations  for  the  GP 
model,  with  minor  changes  manifested  for  the  AR  case).  Since  VB  analysis  is  not 
as  widely  used  in  the  statistics  literature,  for  completeness  we  provide  details  on  its 
modeling  assumptions. 

Let  ©  represent  a  vector  of  all  model  parameters;  the  goal  is  to  infer  the  posterior 
p(& \V).  The  likelihood  of  the  data  is  represented  p{T> |@)  and  the  prior  on  the 
model  parameters  is  denoted  p(@).  Let  g(©;T)  be  a  parametric  distribution  with 
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hyperparameters  T,  and  consider  the  variational  expression 

■nr)  =  J  r)inp(p|Q^e)  =  cK1[«(®;r)||p(0|®)]  -  inP(v)  (10) 

In  VB  analysis  the  goal  is  to  optimize  the  hyperparameters  T  to  minimize  the 
Kullback-Leibler  divergence  between  q(©]T)  and  the  true  posterior  p(©\T>);  this 
corresponds  to  adjusting  T  in  g(0;r)  such  that  J-’(T)  is  minimized.  Note  that 
f  d©q(© ;  r)lnp^|Q^0^  is  only  a  function  of  the  likelihood  p(V |0)  and  the  prior 
p(0),  and  not  the  unknown  posterior;  with  careful  selection  of  (/(©:  T),  numerical 
techniques  akin  to  expectation-maximization  (EM)  Beal  (2003)  can  be  employed  to 
minimize  J-’(T),  with  assurance  of  convergence  to  a  local-optimal  solution. 

Focusing  on  the  GP  temporal  model  (the  AR  case  is  very  similar),  the  model 
parameters  are 

©  =  {{AE.}j=i,...,d, ,  {Bk}k=i,...,K,  {Zk(sit)}t=i,...,T, ,  Co,  ci,  do,  d\}.  (11) 

k=l,...,K 

k=l,...,K 

where  Zk(sn )  ~  Bernoulli^ (sit)),  with  Pk(su)  defined  in  (4).  Completing  the 
generative  process,  vlt  ~  Y\q=i  Poisson(A|  )  if  Zk(slt)  =  0  for  k  <  k  and  Zk(stt)  =  1; 
At .  is  the  j  th  component  of  vector  At. 

fcjL  rvt 

In  VB  one  typically  assumes  a  factorized  form  for  g(0;T),  be.,  q{©'.  T)  = 
),  where  0;  represents  the  /th  set  of  model  parameters  and  qi(©i]Ti) 
is  a  parametric  density  function  with  hyperparameters  Ty  the  union  of  all  0/  cor¬ 
responds  to  0.  Through  careful  selection  of  qi(©p,  T;)  one  may  iteratively  optimize 
the  variational  expression  ^(0). 

For  the  proposed  model,  q(Bk)  is  a  multivariate  normal  distribution,  q(Zk(sit )) 
is  Bernoulli  (with  Bernoulli  probability  defined  by  a  logistic  function),  q(ipk)  is 
multinomial  based  upon  a  finite  library  of  possible  parameters  }z=i,L>  and  q(c0) 
and  q(do)  are  gamma  distributions.  It  is  not  possible  to  define  a  q{\*kr)  that  yields 
closed-form  updates.  Therefore,  the  parameters  X*kr  within  the  VB  analysis  are  also 
approximated  at  each  iteration  via  a  point  estimate  that  maximizes  the  functional 
J-'(r).  Similarly,  q{c\)  and  q(d\)  cannot  be  obtained  in  closed  form.  The  parameters 
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Ci  and  d\  are  updated  on  each  VB  iteration  by  defining  parameters  that  maximize 
the  functional  J-’(r). 

4  Example  Results 

While  the  proposed  model  may  appear  relatively  complicated,  the  number  of  hy¬ 
perparameters  that  need  be  set  is  actually  modest.  We  compare  the  AR-LSBP  and 
GP-LSBP  models  for  imposing  a  prior  on  the  temporal  dependence  with  a  simpler 
model  in  which  the  priors  for  each  time  point  t  are  independent.  In  the  context  of 
this  independent  LSBP  (ind-LSBP),  we  impose 

Pkjt  ~  A/"( 0,  a^t)  ,  akjt  ~  Gamma(a0,  b0 )  (12) 

and  we  set  ao  =  bo  =  10-6  as  in  the  relevance  vector  machine  (RVM)  Tipping  (2001). 
The  same  gamma  priors  are  placed  on  a.p  and  a\  for  the  AR-LSBP  model,  and  on 
c0  and  Ci  for  the  GP-LSBP  model.  In  all  examples  the  truncation  level  on  the  LSBP 
was  set  at  K  =  20,  and  the  results  are  insensitive  to  this  parameter,  as  long  as  it 
is  large  relative  to  the  actual  number  of  clusters/segments  inferred  by  the  model. 
Finally,  we  must  specify  the  library  for  kernel  parameters  {'ipk}k=i,K',  the  manner  in 
which  these  are  specified  is  discussed  when  presenting  the  specific  examples. 

For  uniform  temporal  sampling,  the  AR(1)  and  GP  imposition  of  temporal  dy¬ 
namics  are  theoretically  identical,  for  the  imposed  GP  covariance.  Nevertheless, 
even  for  uniform  temporal  sampling  we  show  results  for  both  of  these  implementa¬ 
tions,  because  the  details  of  the  numerics  dictates  that  the  two  models  are  slightly 
different  in  practice.  Specifically,  within  the  GP  model  a  point  estimate  is  employed 
for  the  kernel  hyperparameters,  with  this  obviously  unnecessary  for  the  direct  AR(1) 
model.  The  comparison  allows  examination  of  the  accuracy  of  this  approximation 
within  the  GP  inference,  relative  to  the  direct  AR(1)  implementation;  this  sheds 
light  on  the  quality  of  the  computations  for  non-uniform  temporal  sampling,  where 
the  GP  implementation  is  required. 
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Figure  2:  Simulation  example.  The  high-intensity  window  moves  gradually  from  [5,10]  to  [10, 
15]  when  time  increases. 


4.1  Simulation  Example 

We  assume  the  data  are  constructed  by  a  total  of  9  equally  spaced  time  instances,  t  = 
1,2,  At  each  time  we  randomly  draw  50  spatial  locations  in  one-dimensional 
space  from  a  uniform  distribution  with  support  [0,  20] ,  denoted  as  sit  ~  Uniform[0,  20] , 
i  —  1, 50,  t  =  1,  ...9.  For  each  location,  we  draw  an  event  count  v,;t  from  a  Poisson 
distribution  with  the  intensity  parameter  Xit.  To  represent  the  time  dynamics,  we 
let  A n  =  20  when  5  +  |(t  —  1)  <  s*t  <  10  +  |(f  —  1),  and  \t  =  1  otherwise.  By  this 
setting  the  high-intensity  window  moves  gradually  from  [5,10]  to  [10,  15]  when  time 
t  increases.  Note  that  here  slt  e  M1  and  vlt  e  M1.  The  kernel  centers  are  defined 
as  §j  =  0.5 (j  —  1)  for  j  =  1, ...,  J.  The  data  are  depicted  in  Figure  2.  Within  the 
analysis,  the  library  of  kernel  parameters  are  the  union  of  the  following  two  sets: 
{0.05,  0.1, 0.05, . . . ,  0.5}  and  {0.5, 1, 1.5, . . . ,  5}. 

The  mean  results  from  VB  are  shown  in  Figure  3,  in  which  the  inferred  Poisson 
rate  is  constituted;  for  these  and  all  VB  results  the  computations  were  stopped  when 
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the  change  in  the  variational  bound  changed  by  10-4.  Further,  all  VB  results  are 
initialized  at  random.  The  VB  results  presented  below  represent  a  local-optimal 
solution,  which  forms  one  source  of  error,  and  this  is  compounded  by  the  factor¬ 
ized  approximation  to  the  posterior.  Nevertheless,  the  VB  implementation  of  the 
GP-LSBP  and  AR-LSBP  model  yields  results  comparable  to  that  of  the  MCMC 
implementation.  When  implementing  MCMC,  a  total  of  10,000  iterations  are  run, 
with  the  first  1000  discarded  as  burn-in.  On  the  same  PC  (and  both  codes  written 
in  Matlab),  the  VB  GP-LSBP  and  AR-LSBP  results  required  approximately  158 
seconds  of  CPU  time,  while  the  VB  ind-LSBP  results  required  approximately  96 
seconds.  In  contrast,  the  GP-LSBP  and  AR-LSBP  results  based  on  the  MCMC 
sampler  required  6517  seconds,  and  ind-LSBP  required  2913  seconds  (109  and  48 
minutes,  respectively).  The  software  was  not  optimized,  and  these  numbers  there¬ 
fore  represent  a  relative  view  of  computational  expense  of  the  VB  and  MCMC 
solutions. 

From  Figure  3  it  is  observed  that,  for  the  VB  solution,  incorporation  of  temporal 
smoothness  in  the  GP-LSBP  model  yields  significant  improvements  in  the  inferred 
Poisson  rate,  as  compared  to  the  VB  ind-LSBP  solution  (with  temporal  dependence 
not  accounted  for  in  the  prior);  the  AR-LSBP  model  performed  similar  to  GP-LSBP. 
It  appears  that  the  prior  constraint  imposed  by  GP / AR  within  the  VB  solution  plays 
an  important  role  in  mitigating  the  underlying  VB  approximations.  By  contrast, 
for  the  MCMC  results  improvements  are  manifested  via  GP-LSBP  and  AR-LSBP 
relative  to  ind-LSBP,  but  in  this  case  the  differences  are  less  dramatic  (plots  of 
MCMC  results  are  not  shown,  for  brevity). 

We  next  examine  the  generative  performance  of  the  proposed  model.  After  the 
model  has  been  learned,  either  via  VB  or  MCMC,  we  randomly  generate  100  new 
test  data,  following  the  same  procedure  that  generated  the  training  data.  We  then 
compute  the  average  log-likelihood  and  the  accuracy  rate  of  segmentation  from  the 
learned  GP-LSBP,  AR-LSBP  and  ind-LSBP  models.  The  accuracy  rate  of  segmen¬ 
tation  is  defined  as  the  number  of  test  data  points  segmented  correctly  as  a  fraction 
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(a)  GP-LSBP  inferred  based  on  VB 


t=1 
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Spatial  location 

(b)  ind-LSBP  inferred  based  on  VB 


Figure  3:  Segmentation  and  latent  intensity  inferred  by  VB:  Comparison  between  GP-LSBP 
and  ind-LSBP,  considering  the  simulated-data  example.  The  AR-LSBP  results  are  similar  to  the 
GP-LSBP  results,  and  are  omitted  for  brevity. 
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of  total  number  of  test  data  points.  The  results  are  summarized  in  Table  1.  We 
find  that  the  GP-LSBP  and  AR-LSBP  models  achieve  a  higher  likelihood  and  accu¬ 
racy  of  segmentation  compared  to  the  ind-LSBP.  Note  that  the  differences  between 
GP-LSBP,  AR-LSBP  and  ind-LSBP  are  relatively  modest  for  the  MCMC  solution, 
while  there  are  again  marked  advantages  in  the  GP-LSBP  and  AR-LSBP  solutions 
relative  to  ind-LSBP  when  employing  VB  inference. 

Table  1:  Comparison  of  generative  performance  between  AR-LSBP,  GP-LSBP  and  ind-LSBP, 
on  simulated  data. 


Method 

Average  log-likelihood 

Accuracy  rate  of  segmentation 

VB 

MCMC 

VB 

MCMC 

AR-LSBP 

-3.702 

-1.749 

0.9796 

0.9801 

GP-LSBP 

-3.882 

-2.082 

0.9765 

0.9757 

ind-LSBP 

-15.544 

-2.274 

0.9478 

0.9741 

Table  2:  Comparison  of  prediction  performance  between  AR-LSBP,  GP-LSBP  and  ind-LSBP. 


Nmiss 

Average  log-likelihood 

Accuracy  rate  of  segmentation 

AR-LSBP 

GP-LSBP 

ind-LSBP 

AR-LSBP 

GP-LSBP 

ind-LSBP 

VB 

MCMC 

VB 

MCMC 

VB 

MCMC 

VB 

MCMC 

VB 

MCMC 

VB 

MCMC 

1 

-3.948 

-1.975 

-4.102 

-2.123 

-21.194 

-2.641 

0.9792 

0.9794 

0.9767 

0.9758 

0.7165 

0.9545 

2 

-4.211 

-2.241 

-4.526 

-2.473 

-27.195 

-3.077 

0.9787 

0.9786 

0.9761 

0.9754 

0.6669 

0.9581 

3 

-4.468 

-2.573 

-4.718 

-2.652 

-27.776 

-3.507 

0.9787 

0.9785 

0.9763 

0.9752 

0.6458 

0.9379 

4 

-4.882 

-2.740 

-5.133 

-3.108 

-26.682 

-3.963 

0.9780 

0.9783 

0.9752 

0.9740 

0.6647 

0.9274 

5 

-5.801 

-3.014 

-5.987 

-3.521 

-31.217 

-4.316 

0.9763 

0.9770 

0.9741 

0.9633 

0.6131 

0.9066 

Finally  we  test  the  prediction  performance  of  the  model.  We  first  generate  data 
V  =  {si,vu}i= i,...,5o,  t= i,...,9  as  discussed  above,  and  then  randomly  select  Nmiss  time 
instances  ti, ...,  tNmiaa  from  t  =  1,...,9,  and  this  constructs  our  test  data  T)tst ;  the 
training  data  Vtrn  is  composed  of  the  data  in  V  but  not  in  Vtst.  We  learn  the 
model  based  on  VB  or  MCMC  analysis  with  T>trn,  and  predict  the  kernel  weights 
/ 3kjf  and  Poisson  intensities  AC  at  time  t.  The  average  log-likelihood  and  accuracy 
of  segmentation  are  evaluated  based  on  the  prediction  results  of  T)tst ,  given  only  the 
spatial  locations  %.  We  perform  100  trials,  and  at  each  trial  Nmiss  time  instances 
are  selected  randomly  to  construct  Vtst .  The  average  results  are  shown  in  Table  2. 


15 


Only  the  GP-LSBP  results  are  fully  principled  in  this  analysis,  where  we  use  the 
learned  parameters  of  the  GP  covariance  matrix  to  interpolate  to  new  time  points 
Rasmussen  and  Wiliams  (2006).  The  AR  model  implicitly  assumes  that  the  data 
are  sampled  uniformly  in  time,  while  the  ind-LSBP  has  no  principled  means  of  in¬ 
terpolating  to  missing  time  points.  Nevertheless,  as  a  comparison,  for  the  AR-LSBP 
computations  in  this  test  the  AR  component  was  simply  applied  to  consecutive  ob¬ 
served  time  points,  essentially  assuming  that  the  temporal  variation  was  smooth, 
even  if  not  sampled  uniformly.  To  interpolate  to  new  points  using  the  learned  AR- 
LSBP  and  ind-LSBP  results,  to  obtain  model  parameters  at  any  new  point  t,  we 
average  the  learned  model  parameters  from  the  two  closest  observed  points,  before 
and  after  t.  From  Table  2  it  is  observed  that  again  for  the  VB  solution,  there  is 
a  marked  advantage  manifested  via  the  GP-LSBP  and  AR-LSBP  priors,  as  com¬ 
pared  to  ind-LSBP.  For  the  MCMC  solution,  there  is  also  a  noticeable  advantage 
manifested  via  the  GP-LSBP  and  AR-LSBP  solutions,  particularly  for  segmenta¬ 
tion  accuracy  for  relatively  large  Nmiss.  Based  upon  the  average  log-likelihood,  we 
note  a  small  but  consistent  advantage  of  the  AR-LSBP  model  over  the  GP-LSBP 
counterpart,  for  both  VB  and  MCMC  computations.  This  observation  on  simulated 
data  will  carry  over  to  the  analysis  of  real  data. 

4.2  Crime  Data 

We  investigate  crime  events  in  Cincinnati,  OH,  USA;  the  data  are  available  online 
at  http:  //www.  cincinnati-oh.gov.  The  data  include  the  date,  time,  location  and 
other  information  of  all  reported  crimes  in  Cincinnati  since  2006.  This  data  set  was 
first  studied  in  Taddy  (2008,  2010),  where  a  mixture  of  beta  distributions  was  em¬ 
ployed  to  model  the  event  density  i'(s),  and  to  discover  the  evolution  of  the  density 
with  time.  In  our  problem  we  seek  to  segment  the  city  into  contiguous  regions,  with 
crime  events  at  each  region  characterized  by  a  common  constant  Poisson  intensity 
vector. 
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We  consider  117,314  crime  events  within  the  city,  reported  from  January  2006  to 
December  2008.  Each  crime  is  assigned  a  uniform  crime  reporting  (UCR)  code.  In 
total  more  than  170  different  UCR  codes  describe  a  variety  of  crimes.  These  crime 
events  can  be  categorized  into  17  different  crime  types,  based  on  the  prefix  of  their 
UCR  codes.  They  are:  1)  murder,  2)  rape,  3)  robbery,  4)  assault  with  weapon, 
5)  burglary,  6)  nonvehicle  theft,  7)  vehicle  theft,  8)  general  assault,  9)  arson,  10) 
forgery,  11)  fraud,  12)  receiving  stolen  property  13)  vandalism,  14)  weapons  related 
but  no  physical  harm,  15)  sexual  crime,  16)  children  related,  17)  general  harassment. 
As  an  example,  the  locations  (latitude  and  longitude  coordinates)  of  the  3090  crime 
events  in  January  2008  are  shown  in  Figure  1(a).  Based  on  the  locations  of  all  the 
117,314  crime  events,  the  observation  window  is  considered  within  a  rectangular 
region  of  [39.06°,  39.24°]  latitude  and  [-84.70°,  -84.35°]  longitude. 

We  construct  the  data  V  =  {s*,  t=i,...,T  as  follows.  The  total  crime 

events  within  one  month  are  considered  as  one  time  instance,  and  therefore  there  are 
in  total  36  time  points.  At  each  time,  the  observation  window  is  divided  into  15,750 
small  square  grids  (90  rows  by  175  columns)  of  size  0.002°  x  0.002°,  and  the  event 
location  sit  is  defined  as  the  center  of  each  small  square  area,  with  this  denoted  as 
A(sj).  The  count  vljt  is  then  the  number  of  Type  j  crimes  within  A(st)  over  the 
corresponding  month  indexed  by  t.  This  produces  a  17-dimensional  count  vector  vlt 
at  Si  for  i  —  1, ...,  15750  and  t  =  1,  ...,36.  Related  research  in  Taddy  (2008,  2010) 
applied  marked  Poisson  processes  to  address  the  crime  types,  regarding  each  crime 
type  at  sit  as  a  random  mark.  Here  we  attempt  to  segment  the  city  by  considering 
all  the  crime  types  within  a  local  region  A(s,t)  as  a  correlated  variable  (a  vector), 
instead  of  treating  each  event  as  a  random  type. 

The  proposed  GP-LSBP,  AR-LSBP  and  ind-LSBP  models  are  inferred  via  VB 
and  MCMC,  with  truncation  level  K  =  20.  The  kernel  centers  are  uniformly  spaced 
every  0.04°  (latitude  and  longitude)  in  the  observation  window,  with  a  total  of  60  ker¬ 
nel  centers  defined.  The  library  of  kernel  parameters  {?/>* }i=i,l  are  the  union  of  the 
following  sets:  {0.006°,  0.012°,  0.018°, . . . ,  0.06°}  and  (0.06°,  0.12°,  0.18°, . . . ,  0.6°}. 
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On  the  same  PC,  the  VB  GP-LSBP  and  AR-LSBP  results  require  approximately 
2.8  hours  of  CPU  time,  while  the  VB  ind-LSBP  results  required  approximately  1.3 
hours.  By  contrast,  due  to  the  large  size  of  the  data,  3000  MCMC  sample  are  em¬ 
ployed,  with  1000  discarded  as  burn-in.  With  the  same  PC,  the  MCMC  GP-LSBP 
and  AR-LSBP  results  required  approximately  47.5  hours.  We  also  considered  10,000 
MCMC  samples,  with  1000  discarded  as  burn-in  (at  very  significant  computational 
cost),  with  little  change  in  the  results  relative  to  those  presented  below. 

Figure  4(a)  shows  the  VB-based  segmentation  of  the  entire  spatial  observation 
window  at  36  time  instances,  using  GP-LSBP  (similar  results  were  found  using  AR- 
LSBP,  omitted  for  brevity).  The  city  is  segmented  into  4  regions  (inferred  by  the 
model),  and  the  segmentation  changes  smoothly  with  time.  For  comparison,  Figure 
4(b)  shows  the  segmentation  results  obtained  by  applying  an  independent  LSBP 
(VB  computations)  at  each  time  instance.  It  is  observed  that  with  GP  priors  the 
proposed  model  presents  a  spatial  segmentation  more  consistently  over  time  and 
spatially  more  contiguously  than  ind-LSBP. 

We  are  also  interested  in  examining  the  clustering  manifested  by  the  MCMC 
computations,  with  this  complicated  by  label  switching  between  samples.  We  com¬ 
pute  an  MCMC  clustering  that  may  be  compared  to  the  VB  results  as  follows.  We 
consider  one  spatial  location  from  Segment  1  in  Figure  4,  denoted  s).  Based  upon 
the  MCMC  collection  samples,  for  each  other  spatial  location  in  the  scene 
we  compute  the  probability  that  position  s  and  s)  are  in  the  same  cluster.  All 
positions  s  with  high  probability  of  such  clustering  should  (ideally)  constitute  a 
spatial  region  similar  to  Segment  1  inferred  via  VB.  In  Figure  5(a)  we  show  MCMC 
results  for  Segment  1,  and  the  high- probability  regions  (red)  do  indeed  align  well 
with  the  VB  results  in  Figure  4.  In  Figure  5(b)  we  compute  similar  MCMC  results 
for  Segment  2,  and  in  this  case  the  high-probability  spatial  locations  are  aligned  well 
with  the  VB  results  for  Segment  2  in  Figure  4.  We  found  in  general  good  agreement 
between  the  VB  and  MCMC  segmentation  results  for  GP-LSBP  and  AR-LSBP  for 
these  data. 
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(b) 

Figure  4:  Comparison  of  spatial  segmentation  for  crime  data  in  Cincinnati,  OH  from  January 
2006  to  December  2008  (VB  results).  Each  color  represents  a  segment  with  an  associated  intensity 
vector  A£t,  and  there  are  totally  four  segments  inferred:  1  -  dark  blue,  2  -  light  blue,  3  -  yellow, 
and  4  -  dark  red.  (a)  GP-LSBP,  (b)  ind-LSBP 
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(b) 


Figure  5:  Comparison  of  spatial  segmentation  for  crime  data  in  Cincinnati,  OH  from  January 
2006  to  December  2008  (MCMC  results),  (a)  Segment  1,  (b)  Segment  2,  where  these  segments  are 
related  to  the  results  in  Figure  4(a). 
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Figures  6(a)-(d)  show  the  dynamic  change  of  the  VB-inferred  Poisson  intensities 
for  each  segment.  To  make  the  figure  easier  to  read,  we  only  plot  components  3,  5  and 
6  from  the  17- dimensional  vector  A *kt\  these  components  correspond  to  crime  types 
“robbery”,  “burglary”,  and  “nonvehicle  theft”,  respectively.  From  these  figures 
we  observed  that  in  all  segments  the  crime  intensities  fluctuated  periodically  over 
season.  Generally  in  summer  there  were  more  crime  events  of  all  types  than  than 
in  winter.  The  overall  crime  intensities  varied  with  regions.  Segments  4  was  in  the 
downtown  region,  and  had  much  more  crime  events  compared  to  other  regions.  In  all 
four  regions  Type  6  crime  (nonvehicle  theft)  was  dominant.  In  addition,  the  crime 
patterns  were  different  in  different  regions.  For  example,  Segment  4  had  relatively 
less  Type  5  crime  (burglary),  while  in  other  3  segments,  the  intensity  of  Type  5 
crime  was  almost  half  of  Type  6  crime.  In  Segments  4,  Type  3  crime  (robbery)  was 
prevalent,  while  Segment  1  had  relatively  less  Type  3  crime.  For  a  comparison,  we 
also  present  the  MCMC-inferred  Poisson  intensities  of  Segment  3,  as  a  representative 
(typical)  example.  It  is  observed  that  the  MCMC  and  VB  results  are  in  generally 
good  agreement,  for  the  GP-LSBP  and  AR-LSBP  models. 

These  results  may  be  used  by  police  to  assign  resources  (personnel)  to  segmented 
regions  in  a  consistent  manner,  to  address  varying  levels  of  crimes.  The  segments 
typically  change  with  season,  and  the  spatial  distribution  of  resources  may  be  tem¬ 
porally  adjusted  as  well.  By  relating  the  demographics  of  regions  to  the  spatial 
segments  (we  didn’t  have  access  to  such  demographics),  one  may  deduce  relation¬ 
ships  between  types  of  crimes  and  the  types  of  people  living  and  working  in  given 
regions,  of  interest  to  criminologists  and  city  planners. 

Following  the  same  procedure  as  in  the  simulated  example,  we  now  examine 
the  prediction  performance  of  our  model  for  the  crime  data.  We  randomly  select 
Nmiss  time  instances  to  construct  a  test  set,  and  let  the  remaining  data  be  the 
training  set.  Ten  random  trials  are  performed  and  the  comparison  of  average  log- 
likelihood  between  GP-LSBP,  AR-LSBP  and  ind-LSBP  inferred  by  VB  is  shown 
in  Table  3.  Since  in  this  real  application  there  is  no  ground  truth,  we  cannot 
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Intensity 


(a)  Segment  1:  Dark  blue  region  in  Fig.  4(a) 


(b)  Segment  2:  Light  blue  region 


(c)  Segment  3:  Yellow  region 


(d)  Segment  4:  Dark  red  region 
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(e)  Segment  3  inferred  by  MCMC 

Figure  6:  Inferred  intensity  vector  \^t  associated  with  the  segments  shown  in  Figure  4(a). 
Only  3  crime  types  are  shown  here  to  make  the  figure  easy  to  read. 
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evaluate  the  accuracy  rate  of  segmentation  as  done  in  the  simulated  example.  From 
Table  3  GP-LSBP  and  AR-LSBP  consistently  achieve  higher  likelihood  than  the 
independent  LSBP  for  various  Nmiss  values.  Note  also  that  for  these  real  data  there 
is  less  of  a  difference  between  the  AR/GP-LSBP  and  ind-LSBP  results  for  the  VB 
solution,  as  compared  to  the  synthetic  data  considered  above.  We  do  not  perform 
this  experiment  for  MCMC  inference,  as  the  computational  requirements  needed  to 
perform  these  many  experiments  are  prohibitive  with  this  large  data  set  (however, 
in  isolated  tests,  the  results  were  slightly  better  than  the  VB-based  GP-LSBP  and 
AR-LSBP  models,  consistent  with  the  simulated  example  above). 


Table  3:  Comparison  of  average  log-likelihood  in  the  prediction  for  the  crime  data  (VB  infer¬ 
ence)  . 


N  ■ 

1  v  miss 

1 

2 

3 

4 

5 

6 

AR-LSBP 

-6.131 

-6.352 

-7.204 

-7.631 

-7.957 

-8.338 

GP-LSBP 

-6.570 

-6.762 

-7.713 

-7.965 

-8.426 

-8.721 

ind-LSBP 

-8.666 

-9.247 

-9.595 

-8.840 

-9.848 

-8.762 

4.3  Pearson  residuals 

Following  Taddy  (2010),  we  check  model  quality  via  computation  of  Pearson  resid¬ 
uals  (see  Baddeley  et  al.  (2005)  for  a  detailed  discussion  of  residuals  for  spatial 
point  processes).  For  the  modeling  framework  considered  here,  the  Pearson  residual 


reduces  to 


(13) 


where  nlt  is  the  number  of  events  in  region  A (sit)  and  \lt  is  the  inferred  Poisson 
rate  parameter  in  small  region  A(su).  Ideally  the  residual  should  be  close  to  zero, 
if  the  underlying  Poisson  assumption  is  valid.  Note  that  within  the  proposed  model 
we  have  a  vector  of  counts  vit,  and  therefore  we  may  compute  the  residual  for  each 
of  the  different  types  of  crimes. 
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Figure  7 :  Pearson  residuals  for  “nonvehicle  theft,”  using  VB  inference;  best  viewed  electrically, 
zoomed  in.  (a)  ind-LSBP,  (b)  GP-LSBP,  (c)  AR-LSBP. 


From  Figure  7,  which  is  based  upon  VB  inference,  we  observe  that  the  Pearson 
residuals  tend  to  decrease  substantially  based  upon  a  model  that  explicitly  imposes 
temporal  smoothness  (note  that  the  residuals  are  significantly  lower  for  GP-LSBP 
and  AR-LSBP,  relative  to  ind-LSBP).  Further,  the  AR-LSBP  residuals  are  smaller 
than  those  of  the  GP-LSBP.  Although  we  omit  the  MCMC  results  for  brevity,  sim¬ 
ilar  phenomena  was  observed  in  that  case.  The  residuals  tend  to  be  small,  in  the 
range  [-2,2],  with  the  larger  values  manifested  on  the  edges  of  segments,  as  might 
be  expected  (segment  interfaces  are  characterized  typically  by  abrupt  changes  in 
statistical  properties) . 

5  Conclusions 

A  Bayesian  hierarchical  model  has  been  presented  for  segmenting  time-evolving 
point  process  data,  when  the  events  are  in  vector  form.  The  spatial-dependent 
point  process  is  modeled  using  a  generalization  of  a  Poisson  process,  with  piecewise 
constant  Poisson  intensities  defined  within  the  observation  window.  The  logistic 
stick-breaking  process  is  employed  to  favor  spatially  contiguous  segments,  and  GP 
and  AR  models  are  considered  for  imposition  of  temporal  smoothness  of  the  seg¬ 
mentation  and  the  Poisson  intensity. 

In  addition  to  developing  the  model,  a  contribution  of  this  paper  concerns  a 
detailed  comparison  between  MCMC  sampling  and  a  VB  approximation.  For  both 
the  synthetic  and  real  data,  it  was  found  that  the  GP-LSBP  and  AR-LSBP  re¬ 
sults  computed  via  VB  and  MCMC  were  in  close  agreement,  and  the  imposition 
of  temporal  smoothness  manifested  via  GP/AR  (compared  to  treating  the  differ¬ 
ent  temporal  samples  independently)  yielded  significant  improvements  in  the  VB 
results.  While  the  VB  results  are  approximate,  and  are  subject  to  local-optimal 
solutions  (although  the  GP/AR  models  seemed  to  mitigate  this  to  some  extent), 
the  VB  approach  provides  significant  advantages  with  regard  to  computations.  For 
the  large  crime  data  set  considered,  while  the  MCMC  results  are  in  principle  con- 
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vergent,  if  run  for  enough  samples,  this  attractiveness  is  mitigated  by  the  very 
significant  computation  time  required  to  realize  a  number  of  collection  samples  to 
assure  that  we  are  indeed  sampling  from  the  posterior.  Given  that  computational 
requirements  will  in  practice  mitigate  the  ability  to  collect  as  many  MCMC  sam¬ 
ples  as  desired  (and  therefore  MCMC  is  also  an  approximation),  the  VB  solution 
appears  to  be  an  attractive  option.  However,  the  results  presented  here  indicate 
that  imposition  of  as  much  information  as  possible  (here  smoothness  via  GP/AR) 
is  desirable.  In  future  research  it  is  of  interest  to  consider  online  VB  analysis  Hoff¬ 
man  et  al.  (2010),  which  provides  further  acceleration  for  large  datasets,  and  it  is 
appropriate  for  time-dependent  data  observed  in  an  online/sequential  manner,  like 
the  time-evolving  crime  data  considered  here. 
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Appendix:  MCMC  and  VB  Update  Equations 


5.1  MCMC  Inference 


The  MCMC  computations  are  performed  using  Gibbs  sampling  where  the  condi¬ 
tional  density  functions  are  analytic,  and  samples  are  drawn  from  the  conditional 
density  functions  via  Metropolis-Hastings  when  not  analytic.  The  update  equations 
are  summarized  as  follows. 


•  Sample  X*k].  from  their  respective  posteriors  conditional  on  {Zk  (sit)}  and 


T  M 

P  (Ay  -)  OC  JJ  JJ  Poisson  (uijt\ \*kjt)I{ci~k)  In  A f  (A^:|  0,Tkj)  .  (14) 

t= 1  i= 1 

It  is  not  possible  to  sample  A£-:  from  the  full  conditions.  We  update  each 
A  Ij.  by  the  Metropolis-Hastings  algorithm.  When  updating  A  £  ■.,  the  proposed 
A*$T+1)  is  generated  from  the  following  distribution 


1  (in  W+1’  I  In  W)  =  U  (in  A$\  (4  +  4)It)  . 

The  acceptance  probability  for  the  proposed  A^r+I)  is  min  ^  1 ,  a  ^A(’,(.r+I  ) 
where 
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•  Sample  /3/.:!  from  their  respective  posteriors  conditional  on  { Zk  (sit)} 

T  M  J 

p(Bk I  -)  oc  Ml  Bk>  0,Sfci).  (17) 

£=1  2=1  j  =  1 
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Reorder  the  entries  of  B k  (and  the  associated  flk)  in  (8)  such  that  Bk  = 
[f3k-.ii ' ' '  i /3k:T]T i  then  we  obtain 


T  M 


p(Bk I -)  oc  exp  -EE  /  ( Vkit )  PL<Pkit<pLtPk:t 

{  t= i  i= i 

f  1  T  M  ) 

•  exp  --Bfc  ^  ^  (2Efc  (sit)  -  1)  <pllt(3k,t  >  (18) 

l  t=i  i= i  J 


So,  Bfc  can  be  draw  from  a  normal  distribution  as 

p (Bk\  -)  =  AT  (Bk,  (fi* 1  +  Uk)-1  Yk,  (W1  +  Uk) 


-1 


(19) 


where  Uk  is  a  ( J  +  1)  Tx  ( J  +  1)  T  block-diagonal  matrix  with  the  t-th  ( J  +  1)  x 


M 


( J  +  1)  block  expressed  as  ukt  =  2  J]  /  (//A.;,)  < fanout  and  Tfcisa(J+l)Txl 

2=1 


vector  formed  by  concatenating  the  T  vectors  ykt  =  XI  (^fc  (s«t)  —  §)  <t>kit,  t  = 

2=1 

1,  •  •  •  ,  T.  In  these  expressions  4>klt  =  [1,  K  (sit,  sp  tpk) ,  •  •  •  ,  K  (sit,  sj ;  ^fc)]T  . 
The  parameter  /  (rjkit)  =  p>litf3k:t. 


•  Sample  Z&  (s#)  from  their  respective  posteriors  conditional  on  Bk  and  { uljt } . 
According  to  the  definition  of  LSBP, 


p(Zk  (. sit )  =  1|  -) 


,f  (  }  =  0!oll<k 

^(Plfe(«<t))p(vit|AJt)+<7(-pfe(ait))p(^t|A*/t)  V  7  ^20) 


a  (gk  (sit)) ,  if  3  l  <  k,  such  that  Zr  (sit)  =  1 

where  k'  is  the  first  integer  larger  than  k,  associated  with  non-zero  indicator. 
The  equation  can  be  expressed  as 
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•  With  a  uniform  prior  assumed  on  the  kernel  parameter  library  (a  predefined 
finite  set),  the  posterior  distribution  for  each  ?/y  can  be  represented  as 

T  M  T  M 

p^k = V’D  °c  mi  if*  ~  (23) 

t= 1  2=1  t=  1  2=1  /C7>/C 

For  each  specific  k  from  k  =  1, K,  we  have  the  following  update  equation 

p(ibk  =  i/’*) 

^  ^ rk  >  rfc  ~  Mult  (Pfcl  ,-,PkL),Pkj=  •  (24) 

We  sample  the  kernel  parameters  based  on  the  multinomial  distributions  from 
a  given  discrete  set  in  each  MCMC  iteration. 

•  Sample  c0  from  its  posteriors  conditional  on  {Bk}  and  {a0,  frol¬ 

ic 

p(c0)  oc  Gamma  (c0;  a0,  fro)  nV(Bt;0,SJt).  (25) 

k=  1 

Therefore,  Co  can  be  drawn  from  a  Gamma  distribution 

p(c0)  =  Gamma  ^c0;  a0,  fro  j  ,  (26) 

K  J 

where  a0  =  a0  +  0.5KT(J  +  1)  and  fr0  =  fro  +  0.5  Pkj-^kjPkj-.  with 

k= 1 j= o 


•  Sample  C\  from  its  posterior  conditional  on  {Bk} 

K 

p(ci)  OC  V(o,i)(ci;0, 1)  nV(Bt;0,nt).  (27) 

k= 1 

When  updating  ci,  the  proposed  c^r+1^  is  generated  from  the  following  distri¬ 
bution 


q  (4T+1)|c[)  =  V(0)i)  (4t+1);  c[,  l)  .  (28) 

The  acceptance  probability  for  the  proposed  c^T+1')  is  min  ^1,  ck(c^t+1\  c[)^ , 
where 
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•  Similarly,  d0  can  be  drawn  from  a  Gamma  distribution 


p(do)  =  Gamma  (do!  do,  do  )  , 


(30) 


K  d 


where  a0  =  ao  +  0.5dKT  and  60  =  &o  +  0.5  lnA]y.rfe  -lnA]y.  with  [r^-]^  = 

A— 1  j=l 


•  Similar  with  c\,  we  update  d\  by  the  Metropolis-Hastings  algorithm.  The 
proposed  d{^+]'  is  generated  from  the  following  distribution 

q  (4r+1)|d[)  =  Ad(0,i)  (4r+1);  d[,  l)  •  (31) 

The  acceptance  probability  for  the  proposed  d[T+1')  is  min  ^1,  a(d[T+1\  d[)^ , 
where 


/  t(t+1)  7 T\  '  k 

a(d\  \d1)  =  -— 


-  i  /  \  |  dK 

hiM^exp  / 1  ( d( -«>2  _  do+o2 


■exP  (dEE  K)lnAW:  -J2T,  ln  (4T+1>)lnAti,  )  )  .(32) 

l  Z  U=i  i=!  k= i  i=!  /  J 


5.2  VB  inference 


The  log-normal  priors  placed  on  the  Poisson  intensities  introduce  non-conjugacy, 
which  results  in  difficulty  for  VB  inference.  Therefore,  we  employ  a  point  estimate 
for  the  Poisson  intensities,  by  maximizing  the  lower  bound  T .  For  the  GP  hy¬ 
perparameters  ci  and  oil,  the  truncated  normal  prior  also  introduce  non-conjugacy. 
Their  posteriors  are  also  inferred  from  point  estimation  by  maximizing  the  VB  lower 
bound.  The  update  equations  of  the  posterior  inference  of  ©  are  summarized  below. 
In  our  model, 

©  =  {{^L-}j=l,...,d,  ,  {Bk}k=l,...,Ki  {Zk(Si:t)}t=l,...,T,  ,  Co,  Cl,  do,  di }. 
k=l,...,K 

k=l 

•  The  lower  bound  for  the  Poisson  intensity  A  £  •.  may  be  derived  as 

F(Kr)  oc  ~AljT^Akj  -  QTkje^  +  Akj  +  constant  (33) 


33 


where  Akj  =  log  (A  *kj.),  Rkj  =  [Ylii\(wk(sn))uiji  ~  1.  •  •  •  ,  J2iii(wk(siT))^ijT  ~ 
1]T,  and  Qkj  =  [J2ii\(wk(sn)),  •  •  •  ,  J2iii(wk(siT))]T ,  with  (•}  denoting  the 
expectation  such  that  (wk(slt))  =  q(wk(sit )  =  1)  (see  Section  2  for  detail  of 
wk(sit ))•  The  point  estimate  for  A*k-.  can  be  updated  at  each  VB  iteration  by 
maximizing  the  lower  bound  J-{Xk]).  One  may  easily  examine  that  J-{Xk]) 
is  a  concave  function,  and  therefore  a  global  maximum  can  be  obtained  by 
any  appropriate  convex  optimization  method.  Note  that  if  Tkj  — >  0  (setting 
large  variance  for  the  prior  distribution),  by  taking  the  derivative  of  (33)  and 
setting  it  to  zero,  we  have  X*kr  =  eAkj  — »  Rkj/Qkji  which  is  consistent  with  the 
update  equation  if  independent  gamma  priors  are  placed  on  A *k-t  for  t  =  1, T. 
Therefore,  the  GP  priors  represented  in  Tkj  introduce  the  correlation  among 
the  components  of  A £■.. 

•  To  update  the  variational  distribution  for  the  kernel  weights  fikjt ,  note  that  the 
logistic  link  function  cr(-)  is  not  within  the  exponential  family  and  therefore 
introduces  the  nonconjugacy.  We  here  follow  Jaakkola  and  Jordan  (1998)  by 
introducing  a  variational  bound  using  the  inequality 

a(y)z[l  -  u(y)]1_z  =  <j{x)  >  a{rj)  exp(^^  -  f(y)(x2  -  rj2)) 

where  x  =  (2 z  —  1  )y,  f(r /)  =  ;  anq  ^  is  a  variational  parameter.  An 

exact  bound  is  achieved  as  rj  —  ±x. 

If  we  reorder  the  entries  of  Bk  (and  the  associated  f lk)  in  (8)  such  that  Bk  = 
[/3k: i, ...,  /3/.::r]T,  the  update  equation  for  Bk  can  be  expressed  as 

q(Bk)  =  AT  ((O^1  +  Uk)~lYk,  (n~kl  +  Uk)-1)  (34) 

where  Uk  is  a  (J  +  1)T  x  (J  +  1)T  block-diagonal  matrix  with  the  fth  ( J  + 
1)  x  ( J  +  1)  block  expressed  as 

M 

Hkt  2  ^  )  / {rfrzit^fpkit&kit 

2=1 


34 


and  Yk  is  a  (J  +  1)T  x  1  vector  formed  by  concatenating  the  T  vectors 

(Zk(sit))  — 

In  above  expressions  (pkit  =  [1,  sp  ipk),  •••,  Sj;  ipk)]T . 

The  variational  parameters  r]kit  are  then  updated  as 

via  =  4>ut(Pl.tPk-.t)4>kit  (35) 

where  (0^:t(3k:t)  =  COV(f3k:t,  f3k:t)  +  (f3k:t)  (f3k:t)T  and  it  may  be  evaluated 
from  q(Bk )  with  the  mean  and  variance  associated  with  time  t. 


2  )  t  =  l,...,T. 


•  The  variational  distribution  for  the  binary  indicator  Zk(su )  may  be  updated 
as 


q(Zk(sit)  =  1) 


1 

1  +  exp  {-pkit) 


(36) 


with 


Pkit  Ih  i  ■  (Zi(sit)))iogp(uit\xit)  -  x;  {Zk'^Sit))  (1  (^z(^it)))  logp(z/j£ \^k't) 

l<k  k'>k  l<k' 

l^k 

J 

+  +  (AcOi) 

3= 1 

where  \ogp(yit\\*ki)  is  the  data  log- likelihood  from  the  Poisson  distribution 
such  that  logp(njt|A£t)  =  log  Poisson(r'ijt|A^t)^ ,  and  the  expectation 

(Pkjt)  can  be  obtained  from  q(Bk). 


•  Due  to  the  non-conjugacy  of  the  sigmoid  function,  we  cannot  acquire  a  vari¬ 
ational  distribution  for  ipk.  However,  we  can  sample  it  from  its  posterior  dis¬ 
tribution  by  establishing  a  discrete  set  of  potential  kernel  widths  {ipp}i= 

The  posterior  distribution  for  each  ip k  is  represented  as 

T  M 

p{ipk  =  ip*)  oc  exp{J^  («;*(*#))  (log  <7  (Pfc(sit)))} 

t=  1  2=1 
T  M 

'exp£EE^'(%^(log  (!  -  °  (0fc(s«)))>}>  (37) 

*  1  i= 1  k’>k 
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where  glk(sit )  =  ^2j=1  Pkjtfc(siti  Sj'i'tPi)  +  /3kot .  The  detailed  calculations  of 
(logo-  ( glk(sit )))  and  (log  (l  —  a  (^[(sit))))  can  be  found  in  Ren  et  al.  (2011). 

•  The  variational  distribution  for  Co  may  be  updated  as. 

q(co )  =  Gamma  ^co;  ao,  bo'j  ,  (38) 

with  d0  =  a0  +  0.5AT(J  +  1)  and  b0  =  b0  +  Y.&kjhiPkjiPkji) 

h=lj=0i=l  1=1 

with  [Ekj]u  = 

•  The  VB  lower  bound  for  c\  may  be  derived  as 

K 

Ha)  =  logA/'(0,i)(ci;0, 1)  +  JUogA/"  (Bfe;  0,  )  +  constant.  (39) 

k= l 

The  point  estimate  for  c\  can  be  updated  at  each  VB  iteration  by  maximizing 
the  lower  bound  T(c\). 

•  Since  point  estimate  of  is  employed  as  each  VB  iteration,  the  variational 
distribution  for  d0  may  be  the  same  with  (30) 

q(d0 )  =  Gamma  (d0;  a0,  bo'j  , 

K  d 

where  a0  =  a0  +  0.5dKT  and  b0  =  b0  +  0.5  X)  zG  hiA^T^lnA^.. 

k= 1 j= 1 

•  Similarly,  the  lower  bound  for  d\  is 

K  d 

Hdi)  =  logA/(o,i)(di;  0, 1)  +  EE  logA/"  (A kj  \  0,  Tkj)  +  constant 

k= 1  j= i 

and  the  point  esitimation  for  di  is  obtained  by  maximizing  J-’(di). 

By  following  (33)-(41),  the  model  parameters  and  GP  hyperparameters  can  be 
updated  iteratively  until  convergence.  In  our  experiment,  we  observed  fast  conver¬ 
gence;  typically  the  relative  change  of  the  lower  bound  reduces  to  10-4  within  100 
iterations. 


(40) 


(41) 
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